Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make memorynew intrinsic #55913

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

oscardssmith
Copy link
Member

@oscardssmith oscardssmith commented Sep 28, 2024

This speeds up making new Memorys and allow the compiler to better understand what's going on, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.

Benchmarks:

julia> f(T) = Memory{T}(undef, 2);

julia> @btime f(Int);
  7.879 ns (1 allocation: 48 bytes) #before
  3.959 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Union{Int, Nothing})
  8.426 ns (1 allocation: 48 bytes) #before
  4.017 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Any);
  8.637 ns (1 allocation: 48 bytes) #before
  3.920 ns (1 allocation: 48 bytes) #after 

julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after

src/cgutils.cpp Outdated Show resolved Hide resolved
@nsajko nsajko added the arrays [a, r, r, a, y, s] label Sep 28, 2024
@oscardssmith oscardssmith added the performance Must go faster label Sep 29, 2024
@oscardssmith
Copy link
Member Author

@gbaraldi so with LLVM assertions enabled I'm getting

julia: /home/oscardssmith/Documents/Code/julia/deps/srccache/llvm-julia-18.1.7-2/llvm/lib/IR/Instructions.cpp:685: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

[398134] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7108b6c2871a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_ZN4llvm8CallInst4initEPNS_12FunctionTypeEPNS_5ValueENS_8ArrayRefIS4_EENS5_INS_17OperandBundleDefTIS4_EEEERKNS_5TwineE at /home/oscardssmith/Documents/Code/julia/usr/bin/../lib/libLLVM.so.18.1jl (unknown line)
CallInst at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1692 [inlined]
Create at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1540 [inlined]
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2398
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2420 [inlined]
emit_memorynew at /home/oscardssmith/Documents/Code/julia/src/cgutils.cpp:4555 [inlined]

which is on the line that does ctx.builder.CreateCall(prepare_call(jl_alloc_genericmemory_unchecked_func), {ptls, nbytes, cg_typ});. Any ideas?

@gbaraldi
Copy link
Member

gbaraldi commented Oct 4, 2024

I'd print everyone involved here with the way I showed you yesterday thing->print(dbgs()), So print the func and the arguments

@oscardssmith
Copy link
Member Author

This now works! For simple examples like Memory{Int}(undef, 3) it's about 7ns compared to 11 previously. Also this version is almost able to be analyzed by our escape analysis to delete the Memory when it doesn't (we just need to teach the llvm-alloc-helpers about gc_loaded, and for non @inbounds accesses we need to teach LLVM that the boundschecks are dead).

@gbaraldi
Copy link
Member

gbaraldi commented Oct 4, 2024

As an example of what is possible.

Allocopt was able to go from

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  %memoryref_mem = call dereferenceable(40) ptr addrspace(10) @julia.gc_alloc_obj(ptr nonnull %current_task1, i64 40, ptr addrspace(10) addrspacecast (ptr @"+Core.GenericMemory#771.jit" to ptr addrspace(10))), !dbg !14
  %0 = addrspacecast ptr addrspace(10) %memoryref_mem to ptr addrspace(11), !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr addrspace(11) %0, i64 0, i32 1, !dbg !14
  %2 = call nonnull ptr @julia.pointer_from_objref(ptr addrspace(11) %0) #4, !dbg !14
  %3 = getelementptr inbounds i8, ptr %2, i64 16, !dbg !14
  store ptr %3, ptr addrspace(11) %1, align 8, !dbg !14
  store i64 3, ptr addrspace(11) %0, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) %memoryref_mem, ptr %3), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

to. Removing the allocation. Which likely would allow it to just return the 11

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %memoryref_mem = alloca [40 x i8], align 16
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  call void @llvm.lifetime.start.p0(i64 40, ptr %memoryref_mem)
  %0 = freeze [40 x i8] undef, !dbg !14
  store [40 x i8] %0, ptr %memoryref_mem, align 1, !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr %memoryref_mem, i64 0, i32 1, !dbg !14
  %2 = getelementptr inbounds i8, ptr %memoryref_mem, i64 16, !dbg !14
  store ptr %2, ptr %1, align 8, !dbg !14
  store i64 3, ptr %memoryref_mem, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) null, ptr %2), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

@oscardssmith
Copy link
Member Author

julia> @btime Memory{Int8}(undef, 3)
  7.169 ns (1 allocation: 32 bytes) # before
  3.986 ns (1 allocation: 32 bytes) # after

@giordano
Copy link
Contributor

giordano commented Oct 4, 2024

FixedSizeArrays.jl is going to be much nicer 🙂

src/cgutils.cpp Outdated Show resolved Hide resolved
oscardssmith added a commit that referenced this pull request Oct 7, 2024
combined with #55913, the compiler is smart enough to fully remove
```
function f()
    m = Memory{Int}(undef, 3)
    @inbounds m[1] = 2
    @inbounds m[2] = 2
    @inbounds m[3] = 4
    @inbounds return m[1] + m[2] + m[3]
end
```
src/ccall.cpp Outdated Show resolved Hide resolved
oscardssmith added a commit that referenced this pull request Oct 10, 2024
combined with #55913, the
compiler is smart enough to fully remove
```
function f()
    m = Memory{Int}(undef, 3)
    @inbounds m[1] = 2
    @inbounds m[2] = 2
    @inbounds m[3] = 4
    @inbounds return m[1] + m[2] + m[3]
end
```
@giordano
Copy link
Contributor

Can you please add an llvm pass test for #56030 (comment) (removing all memory for a simple case where the Memory object doesn't escape)?

@oscardssmith
Copy link
Member Author

Do you want an actual LLVM pass, or can I just write a test for 0 allocations?

@giordano
Copy link
Contributor

I think an llvm test would be more robust, but probably a simple zero-allocation test would do the job as well.

@oscardssmith
Copy link
Member Author

LOL. This test is so good it broke a doctest in performance tips. We're testing to show that you get allocations if you have "bad" code that allocates arrays, but now it doesn't allocate :laughing

@oscardssmith
Copy link
Member Author

This is now on top of #55995 (to figure out why we weren't optimizing correctly), but other than that, I think this is good to go!

@giordano
Copy link
Contributor

Maybe a test of no allocations in simple cases as discussed above? 🙂

@MasonProtter
Copy link
Contributor

MasonProtter commented Nov 1, 2024

Spooky! 👻 🎃

@oscardssmith
Copy link
Member Author

@aviatesk any idea why the effects tests are failing?

@aviatesk
Copy link
Member

aviatesk commented Nov 6, 2024

@aviatesk any idea why the effects tests are failing?

I guess this issue has been resolved by implementing the effects modeling for the new builtin?

@oscardssmith
Copy link
Member Author

yeah, I hadn't realized how good the effects for this intrinsic are.

@oscardssmith
Copy link
Member Author

actually CI thinks stuff is still very broken (which is weird given that the test passed locally)

@MasonProtter
Copy link
Contributor

So currently this appears to only allow for stack allocation when the size of the Memory is statically known at compile time within the function body. Is there hope for this to happen with dynamically sized Memory, or is that significantly harder to do here?

@nsajko
Copy link
Contributor

nsajko commented Nov 6, 2024

But stack size is limited. Stack-allocating when the size isn't known at compile time seems dangerous?

@MasonProtter
Copy link
Contributor

MasonProtter commented Nov 6, 2024

Could have runtime checks on the size.

@oscardssmith
Copy link
Member Author

the dynamic version would be great, but would add an extra ~100 lines of coffee since you would have to make the basic blocks to handle both sides of the allocation

@MasonProtter
Copy link
Contributor

MasonProtter commented Nov 6, 2024

Mhm. What I do with the SlabBuffer in Bumper.jl is at the point of allocation, I do something like

if fits_in_slab(sz, slab)
    p = alloca_like_path(sz, slab)
else
    p = @noinline malloc_like_path(sz)
end

with the cleanup phase at the end of the region then either resetting the slab, or freeing the malloc. The more I write this though the more I realize how out of scope this is for this PR so maybe we should discuss elsewhere.

@gbaraldi
Copy link
Member

gbaraldi commented Nov 6, 2024

So we can probably do something better if we can prove that the dynamically sized array doesn't escape. https://github.com/JuliaLang/julia/pull/52382/files started some work on this. We could for example have it be malloc, it being a stack allocation is difficult because VLAs are sketchy, so we would need to do something like allocate a small buffer and switch to malloc, the extra branch makes me think it should just be a malloc.

@oscardssmith oscardssmith force-pushed the os/make-memorynew-tfunc branch 2 times, most recently from 103c575 to e86f702 Compare November 26, 2024 19:05
@giordano
Copy link
Contributor

The doctest failure

│ Subexpression:
│ 
│ for i in [1,4,0]
│     println(i)
│ end
│ 
│ Evaluated output:
│ 
│ 1
│ 1
│ 24820
│ 
│ Expected output:
│ 
│ 1
│ 4
│ 0

is concerning, since a simple for loop is giving completely wrong results. The doctest failure about allocation disappearing is much more welcome instead.

@oscardssmith
Copy link
Member Author

yeah. something in alloc_opt is lying about the lifetime of the memory leading it to be removed spuriously. Gabriel and I are looking into it.

gbaraldi and others added 13 commits November 27, 2024 23:03
The Value printer LLVM uses just prints the kind of instruction so it just shows call.

Update llvm-alloc-opt.cpp
no segfault

empty mem optimization

optimized!

fix zeroinit

it works

It's alive!

use checked rather than overflow

fix error path

sink ptls

debugs

fixes

fix optimization

typo

cleanup and test

mark some tests broken

mark some as broken

move boundscheck to julia and improve codegen

fix-bootsrap

off by 1

update overflow message

switch wideint_t to overflowing operations

address review

Co-authored-by: Jameson Nash <[email protected]>

Update src/codegen.cpp

Co-authored-by: Jameson Nash <[email protected]>

Update src/codegen.cpp

Co-authored-by: Jameson Nash <[email protected]>

Update src/codegen.cpp

Co-authored-by: Jameson Nash <[email protected]>

Update src/builtins.c

Co-authored-by: Jameson Nash <[email protected]>

Update src/codegen.cpp

Co-authored-by: Jameson Nash <[email protected]>

adress review

fix tbaa

fix issues

add test

fix build on clang

fix
Co-authored-by: Jeff Bezanson <[email protected]>
Co-authored-by: Jeff Bezanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants